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Abstract 

We introduce a Gaussian process model of functions which are additive. An addi- 
tive function is one which decomposes into a sum of low-dimensional functions, 
each depending on only a subset of the input variables. Additive GPs general- 
ize both Generalized Additive Models, and the standard GP models which use 
squared-exponential kernels. Hyperparameter learning in this model can be seen 
as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but 
tractable parameterization of the kernel function, which allows efficient evalua- 
tion of all input interaction terms, whose number is exponential in the input di- 
mension. The additional structure discoverable by this model results in increased 
interpretability, as well as state-of-the-art predictive power in regression tasks. 



1 Introduction 

Most statistical regression models in use today are of the form: g{y) = f{xi)-\-f{x2)-\ \-f{xD)- 

Popular examples include logistic regression, linear regression, and Generalized Linear Models |T]- 
This family of functions, known as Generalized Additive Models (GAM) |[2|, are typically easy 
to fit and interpret. Some extensions of this family, such as smoothing- splines ANOVA |3|, add 
terms depending on more than one variable. However, such models generally become intractable 
and difficult to fit as the number of terms increases. 

At the other end of the spectrum are kernel-based models, which typically allow the response to 
depend on all input variables simultaneously. These have the form: y = /(xi,X2,...,xi)). A 
popular example would be a Gaussian process model using a squared-exponential (or Gaussian) 
kernel. We denote this model as SE-GP. This model is much more flexible than the GAM, but its 
flexibility makes it difficult to generalize to new combinations of input variables. 

In this paper, we introduce a Gaussian process model that generalizes both GAMs and the SE-GP. 
This is achieved through a kernel which allow additive interactions of all orders, ranging from first 
order interactions (as in a GAM) all the way to I^th-order interactions (as in a SE-GP). Although 
this kernel amounts to a sum over an exponential number of terms, we show how to compute this 
kernel efficiently, and introduce a parameterization which limits the number of hyperparameters to 
0{D). A Gaussian process with this kernel function (an additive GP) constitutes a powerful model 
that allows one to automatically determine which orders of interaction are important. We show 
that this model can significantly improve modeling efficacy, and has major advantages for model 
interpretability. This model is also extremely simple to implement, and we provide example code. 

We note that a similar breakthrough has recently been made, called Hierarchical Kernel Learning 
(HKL) |4 |. HKL explores a similar class of models, and sidesteps the possibly exponential num- 
ber of interaction terms by cleverly selecting only a tractable subset. However, this method suffers 
considerably from the fact that cross-validation must be used to set hyperparameters. In addition, 
the machinery necessary to train these models is immense. Finally, on real datasets, HKL is outper- 
formed by the standard SE-GP |[4|. 
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Figure 1: A first-order additive kernel, and a product kernel. Left: a draw from a first-order additive 
kernel corresponds to a sum of draws from one-dimensional kernels. Right: functions drawn from a 
product kernel prior have weaker long-range dependencies, and less long-range structure. 



2 Gaussian Process Models 



Gaussian processes are a flexible and tractable prior over functions, useful for solving regression 
and classification tasks |5|. The kind of structure which can be captured by a GP model is mainly 
determined by its kernel, the covariance function. One of the main difficulties in specifying a 
Gaussian process model is in choosing a kernel which can represent the structure present in the data. 
For small to medium- sized datasets, the kernel has a large impact on modeling efficacy. 

Figure [T] compares, for two-dimensional functions, a first-order additive kernel with a second-order 
kernel. We can see that a GP with a first-order additive kernel is an example of a GAM: Each 
function drawn from this model is a sum of orthogonal one-dimensional functions. Compared to 
functions drawn from the higher-order GP, draws from the first-order GP have more long-range 
structure. 

We can expect many natural functions to depend only on sums of low-order interactions. For ex- 
ample, the price of a house or car will presumably be well approximated by a sum of prices of 
individual features, such as a sun-roof. Other parts of the price may depend jointly on a small set of 
features, such as the size and building materials of a house. Capturing these regularities will mean 
that a model can confidently extrapolate to unseen combinations of features. 



3 Additive Kernels 



We now give a precise definition of additive kernels. We first assign each dimension i G {1 . . . D} 
a one-dimensional base kernel ki{xi, x'-). We then define the first order, second order and nth order 
additive kernel as: 
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where D is the dimension of our input space, and is the variance assigned to all nth order 
interactions. The nth covariance function is a sum of (^) terms. In particular, the I^th order additive 
CO variance function has (^) = 1 term, a product of each dimension's covariance function: 

D 

kaddn (X, x') = Cr|) ]^ kd{Xd, x'^) (4) 

In the case where each base kernel is a one-dimensional squared-exponential kernel, the I^th-order 
term corresponds to the multivariate squared-exponential kernel: 



^ ^ / (Xrl X^ 1^ \ / (Xrl X 
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(5) 

also commonly known as the Gaussian kernel. The full additive kernel is a sum of the additive 
kernels of all orders. 



3. 1 Parameterization 



The only design choice necessary in specifying an additive kernel is the selection of a one- 
dimensional base kernel for each input dimension. Any parameters (such as length-scales) of the 
base kernels can be learned as usual by maximizing the marginal likelihood of the training data. 

In addition to the hyperparameters of each dimension- wise kernel, additive kernels are equipped 
with a set of D hyperparameters . . . cr|) controlling how much variance we assign to each or- 
der of interaction. These "order variance" hyperparameters have a useful interpretation: The dth 
order variance hyperparameter controls how much of the target function's variance comes from in- 
teractions of the d\h order. Table [T] shows examples of normalized order variance hyperparameters 
learned on real datasets. 



Table 1: Relative variance contribution of each order in the additive model, on different datasets. Here, the 
maximum order of interaction is set to 10, or smaller if the input dimension less than 10. Values are normalized 
to sum to 100. 



Order of interaction 


1st 


2nd 


3rd 


4th 


5th 


6th 


7th 


8th 


9th 


10th 


pima 


0.1 


0.1 


0.1 


0.3 


1.5 


96.4 


1.4 


0.0 






liver 


0.0 


0.2 


99.7 


0.1 


0.0 


0.0 










heart 


77.6 


0.0 


0.0 


0.0 


0.1 


0.1 


0.1 


0.1 


0.1 


22.0 


concrete 


70.6 


13.3 


13.8 


2.3 


0.0 


0.0 


0.0 


0.0 






pumadyn-8nh 


0.0 


0.1 


0.1 


0.1 


0.1 


0.1 


0.1 


99.5 






servo 


58.7 


27.4 


0.0 


13.9 














housing 


0.1 


0.6 


80.6 


1.4 


1.8 


0.8 


0.7 


0.8 


0.6 


12.7 



On different datasets, the dominant order of interaction estimated by the additive model varies 
widely. An additive GP with all of its variance coming from the 1st order is equivalent to a GAM; 
an additive GP with all its variance coming from the I^th order is equivalent to a SE-GP. 

Because the hyperparameters can specify which degrees of interaction are important, the additive 
GP is an extremely general model. If the function we are modeling is decomposable into a sum of 
low-dimensional functions, our model can discover this fact and exploit it (see Figure |5]) . If this is 
not the case, the hyperparameters can specify a suitably flexible model. 

3.2 Interpretability 

As noted by Plate |6|, one of the chief advantages of additive models such as GAM is their inter- 
pretability. Plate also notes that by allowing high-order interactions as well as low-order interactions, 
one can trade off interpretability with predictive accuracy. In the case where the hyperparameters in- 
dicate that most of the variance in a function can be explained by low-order interactions, it is useful 
and easy to plot the corresponding low-order functions, as in Figure [2] 
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Figure 2: Low-order functions on the concrete dataset. Left, Centre: By considering only first-order 
terms of the additive kernel, we recover a form of Generalized Additive Model, and can plot the 
corresponding 1 -dimensional functions. Green points indicate the original data, blue points are data 
after the mean contribution from the other dimensions' first-order terms has been subtracted. The 
black line is the posterior mean of a GP with only one term in its kernel. Right: The posterior mean 
of a GP with only one second-order term in its kernel. 



3.3 Efficient Evaluation of Additive Kernels 

An additive kernel over D inputs with interactions up to order n has 0(2^) terms. Naively summing 
over these terms quickly becomes intractable. In this section, we show how one can evaluate the sum 
over all terms 'mO{D^). 

The nth order additive kernel corresponds to the nth elementary symmetric polynomial ||7| |[8|, which 
we denote e^. For example: if x has 4 input dimensions {D = 4), and if we let Zi = ki{xi^x'-), then 

kaddA^^^) = 61(2^1,^2,^3,^4) = ^1 + ^2 + ^3 + ^4 

kadd2{^i^) = 62(2:1,^2,^3,^4) = ^1^2 + ZiZ^ + Z1Z4 + 2:22:3 + 2:2^4 + ^3^4 
kaddai^^^') = 63(2:1,^2,^3,2:4) = ^1^2^3 + 2:12:22:4 + ^l^3^4 + 2:22:32:4 
kaddA^,^) = 64(2:1,^2,^3,2:4) = ^1^2^32:4 

The Newton-Girard formulae give an efficient recursive form for computing these polynomials. If 
we define Sk to be the kth power sum: s/e(2:i, 2:2, . . . , 2:^) = X]2=i ^^^^ 

1 

kadd^y^.y^') = en{zi,...,ZD) = - V'(-l)^^"^^en-fe(2:i, 2:^)5^(^1, ^d) (6) 

n 

k=l 

Where eo = 1. The Newton-Girard formulae have time complexity 0{D'^), while computing a sum 
over an exponential number of terms. 

Conveniently, we can use the same trick to efficiently compute all of the necessary derivatives of the 
additive kernel with respect to the base kernels. We merely need to remove the kernel of interest 
from each term of the polynomials: 

= en-i(^i, . . . , 2:^-1, 2:^+1, . . . zd) \l) 

This trick allows us to optimize the base kernel hyperparameters with respect to the marginal likeli- 
hood. 

3.4 Computation 

The computational cost of evaluating the Gram matrix of a product kernel (such as the SE kernel) is 
0{N^D\ while the cost of evaluating the Gram matrix of the additive kernel is 0{N'^DR), where 
R is the maximum degree of interaction allowed (up to D). In higher dimensions, this can be a 
significant cost, even relative to the fixed 0{N^) cost of inverting the Gram matrix. However, as 
our experiments show, typically only the first few orders of interaction are important for modeling a 
given function; hence if one is computationally limited, one can simply limit the maximum degree 
of interaction without losing much accuracy. 
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Figure 3: A comparison of different models. Nodes represent different interaction terms, ranging 
from first-order to fourth-order interactions. Far left: HKL can select a hull of interaction terms, but 
must use a pre-determined weighting over those terms. Far right: the additive GP model can weight 
each order of interaction seperately. Neither the HKL nor the additive model dominate one another 
in terms of flexibility, however the GP-GAM and the SE-GP are special cases of additive GPs. 



Additive Gaussian processes are particularly appealing in practice because their use requires only 
the specification of the base kernel. All other aspects of GP inference remain the same. All of the 
experiments in this paper were performed using the standard GPML toolbo:x[^ code to perform all 
experiments is available at the author's website r] 



4 Related Work 



Plate IS] constructs a form of additive GP, but using only the first-order and I^th order terms. This 
model is motivated by the desire to trade off the interpretability of first-order models, with the flexi- 
bility of full-order models. Our experiments show that often, the intermediate degrees of interaction 
contribute most of the variance. 

A related functional ANOVA GP model ||9| decomposes the mean function into a weighted sum of 
GPs. However, the effect of a particular degree of interaction cannot be quantified by that approach. 
Also, computationally, the Gibbs sampling approach used in | 9 1 is disadvantageous. 

Christoudias et al. 1 10] previously showed how mixtures of kernels can be learnt by gradient descent 
in the Gaussian process framework. They call this Bayesian localized multiple kernel learning. 
However, their approach learns a mixture over a small, fixed set of kernels, while our method learns 
a mixture over all possible products of those kernels. 



4.1 Hierarchical Kernel Learning 

Bach |4| uses a regularized optimization framework to learn a weighted sum over an exponential 
number of kernels which can be computed in polynomial time. The subsets of kernels considered by 
this method are restricted to be a hull of kernels |^ Given each dimension's kernel, and a pre-defined 
weighting over all terms, HKL performs model selection by searching over hulls of interaction 
terms. In |4 |, Bach also fixes the relative weighting between orders of interaction with a single term 
a, computing the sum over all orders by: 

D 

^a(x, x') = v|) ]^ (1 + akd{xd, x^)) (8) 

which has computational complexity 0{D). However, this formulation forces the weight of all nth 
order terms to be weighted by . 

Figure [3] contrasts the HKL hull- selection method with the Additive GP hyperparameter-learning 
method. Neither method dominates the other in flexibility. The main difficulty with the approach 

^Available at http : //www . gaussianprocess . org/gpml/code/ 
^Example code available at: http : //mlg . eng . cam. ac . uk/duvenaud/ 

^In the setting we are considering in this paper, a hull can be defined as a subset of all terms such that if term 
YljeJ ^ ) included in the subset, then so are all terms Hjej/i (^5 ^ )' i G J. For details, 
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of Q is that hyperparameters are hard to set other than by cross-vaHdation. In contrast, our method 
optimizes the hyperparameters of each dimension's base kernel, as well as the relative weighting of 
each order of interaction. 

4.2 ANOVA Procedures 

Vapnik |fTTl introduces the support vector ANOVA decomposition, which has the same form as our 
additive kernel. However, they recommend approximating the sum over all D orders with only one 
term "of appropriate order", presumably because of the difficulty of setting the hyperparameters of 
an SVM. Stitson et al. |fT2| performed experiments which favourably compared the support vector 
ANOVA decomposition to polynomial and spline kernels. They too allowed only one order to be 
active, and set hyperparameters by cross-validation. 

A closely related procedure from the statistics literature is smoothing-splines ANOVA (SS-ANOVA) 
|[3|. An SS-ANOVA model is estimated as a weighted sum of splines along each dimension, plus 
a sum of splines over all pairs of dimensions, all triplets, etc, with each individual interaction term 
having a separate weighting parameter. Because the number of terms to consider grows exponen- 
tially in the order, in practice, only terms of first and second order are usually considered. Learning 
in SS-ANOVA is usually done via penalized-maximum likelihood with a fixed sparsity hyperparam- 
eter. 

In contrast to these procedures, our method can easily include all D orders of interaction, each 
weighted by a separate hyperparameter. As well, we can learn kernel hyperparameters individually 
per input dimension, allowing automatic relevance determination to operate. 

4.3 Non-local Interactions 

By far the most popular kernels for regression and classification tasks on continuous data are 
the squared exponential (Gaussian) kernel, and the Matern kernels. These kernels depend 
only on the scaled Euclidean distance between two points, both having the form: /c(x, x') = 

fiY^d^i i^d — x'df /^d)' B^ngio et al. 1 13 1 argue that models based on squared-exponential kernels 
are particularity susceptible to the curse of dimensionality. They emphasize that the locality of the 
kernels means that these models cannot capture non-local structure. They argue that many functions 
that we care about have such structure. Methods based solely on local kernels will require training 
examples at all combinations of relevant inputs. 



Iff il • ilQ 

1st order interactions 2nd order interactions 3rd order interactions All interactions 

^1 + ^2 + ^3 kik2 ^ k2ks ^ kiks kik2ks 

(Squared-exp kernel) (Additive kernel) 

Figure 4: Isocontours of additive kernels in 3 dimensions. The third-order kernel only considers 
nearby points relevant, while the lower-order kernels allow the output to depend on distant points, 
as long as they share one or more input value. 

Additive kernels have a much more complex structure, and allow extrapolation based on distant 
parts of the input space, without spreading the mass of the kernel over the whole space. For example, 
additive kernels of the second order allow strong non-local interactions between any points which are 
similar in any two input dimensions. Figure]?] provides a geometric comparison between squared- 
exponential kernels and additive kernels in 3 dimensions. 
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5 Experiments 



5.1 Synthetic Data 



Because additive kernels can discover non-local structure in data, they are exceptionally well-suited 
to problems where local interpolation fails. Figure[5]shows a dataset which demonstrates this feature 




True Function Squared-exp GP Additive GP Additive GP 

& data locations posterior mean posterior mean Ist-order functions 



Figure 5: Long-range inference in functions with additive structure. 

of additive GPs, consisting of data drawn from a sum of two axis-aligned sine functions. The 
training set is restricted to a small, L-shaped area; the test set contains a peak far from the training 
set locations. The additive GP recovered both of the original sine functions (shown in green), and 
inferred correctly that most of the variance in the function comes from first-order interactions. The 
ability of additive GPs to discover long-range structure suggests that this model may be well-suited 
to deal with covariate- shift problems. 

5.2 Experimental Setup 

On a diverse collection of datasets, we compared five different models. In the results tables below, 
GP Additive refers to a GP using the additive kernel with squared-exp base kernels. For speed, 
we limited the maximum order of interaction in the additive kernels to 10. GP-GAM denotes an 
additive GP model with only first-order interactions. GP Squared-Exp is a GP model with a squared- 
exponential ARD kernel. HKlj^was run using the all-subsets kernel, which corresponds to the same 
set of kernels as considered by the additive GP with a squared-exp base kernel. 

For all GP models, we fit hyperparameters by the standard method of maximizing training-set 
marginal likelihood, using L-BFGS |T4| for 500 iterations, allowing five random restarts. In addition 
to learning kernel hyperparameters, we fit a constant mean function to the data. In the classification 
experiments, GP inference was done using Expectation Propagation y_5J. 

5.3 Results 

Tables [2j [3] |4] and [5] show mean performance across 10 train-test splits. Because HKL does not 
specify a noise model, it could not be included in the likelihood comparisons. 



Table 2: Regression Mean Squared Error 



Method 


bach 


concrete 


pumadyn-8nh 


servo 


housing 


Linear Regression 


L031 


0.404 


0.641 


0.523 


0.289 


GP GAM 


L259 


0.149 


0.598 


0.281 


0.161 


HKL 


0.199 


0.147 


0.346 


0.199 


0.151 


GP Squared-exp 


0.045 


0.157 


0.317 


0.126 


0.092 


GP Additive 


0.045 


0.089 


0.316 


0.110 


0.102 



The model with best performance on each dataset is in bold, along with all other models that were 
not significantly different under a paired t-test. The additive model never performs significantly 
worse than any other model, and sometimes performs significantly better than all other models. The 

^Code for HKL available at http : / /www . di . ens . f r/ ~ f bach/hkl/ 
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Table 3: Regression Negative Log Likelihood 



Method 


bach 


concrete 


pumadyn-8nh 


servo 


housing 


Linear Regression 


2.430 


L403 


L881 


L678 


L052 


GP GAM 


L708 


0.467 


L195 


0.800 


0.457 


GP Squared-exp 


-0.131 


0.398 


0.843 


0.429 


0.207 


GP Additive 


-0.131 


0.114 


0.841 


0.309 


0.194 



Table 4: Classification Percent Error 



Method 


breast 


pima 


sonar 


ionosphere 


liver 


heart 


Logistic Regression 


7.611 


24.392 


26.786 


16.810 


45.060 


16.082 


GP GAM 


5.189 


22.419 


15.786 


8.524 


29.842 


16.839 


HKL 


5.377 


24.261 


21.000 


9.119 


27.270 


18.975 


GP Squared-exp 


4.734 


23.722 


16.357 


6.833 


31.237 


20.642 


GP Additive 


5.566 


23.076 


15.714 


7.976 


30.060 


18.496 



Table 5: Classification Negative Log Likelihood 



Method 


breast 


pima 


sonar 


ionosphere 


liver 


heart 


Logistic Regression 


0.247 


0.560 


4.609 


0.878 


0.864 


0.575 


GP GAM 


0.163 


0.461 


0.377 


0.312 


0.569 


0.393 


GP Squared-exp 


0.146 


0.478 


0.425 


0.236 


0.601 


0.480 


GP Additive 


0.150 


0.466 


0.409 


0.295 


0.588 


0.415 



difference between all methods is larger in the case of regression experiments. The performance of 
HKL is consistent with the results in ||4|, performing competitively but slightly worse than SE-GP. 

The additive GP performed best on datasets well-explained by low orders of interaction, and ap- 
proximately as well as the SE-GP model on datasets which were well explained by high orders of 
interaction (see table [T]). Because the additive GP is a superset of both the GP-GAM model and the 
SE-GP model, instances where the additive GP performs slightly worse are presumably due to over- 
fitting, or due to the hyperparameter optimization becoming stuck in a local maximum. Additive GP 
performance can be expected to benefit from integrating out the kernel hyperparameters. 



6 Conclusion 



We present additive Gaussian processes: a simple family of models which generalizes two widely- 
used classes of models. Additive GPs also introduce a tractable new type of structure into the GP 
framework. Our experiments indicate that such additive structure is present in real datasets, allowing 
our model to perform better than standard GP models. In the case where no such structure exists, 
our model can recover arbitrarily flexible models, as well. 

In addition to improving modeling efficacy, the additive GP also improves model interpretability: 
the order variance hyperparameters indicate which sorts of structure are present in our model. 

Compared to HKL, which is the only other tractable procedure able to capture the same types of 
structure, our method benefits from being able to learn individual kernel hyperparameters, as well 
as the weightings of different orders of interaction. Our experiments show that additive GPs are a 
state-of-the-art regression model. 
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